Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add milvus vector db integration #419

Merged
merged 7 commits into from
Jan 3, 2025

Conversation

AbhishekRP2002
Copy link
Contributor

@AbhishekRP2002 AbhishekRP2002 commented Dec 19, 2024

  • initial draft PR for basic milvus vector db integration, aiming [Feature Request]: Add support for Milvus Vector Database #415
  • handled minor typos that crossed my eyes in backend/modules/vector_db/mongo.py
  • @chiragjn Can u please help me in understanding the test setup that you guys follow for testing the functionalities / unit testing ?
  • This PR includes a basic implementation of Milvus vector DB integration ( via Milvus Lite predominantly with Quick setup ref: https://milvus.io/docs/create-collection-instantly.md#Quick-Setup )
  • Post successful merge of the basic implementation we can scope out the integration of additional extended configs for Milvus like it's done for Qdrant and provide more flexibility with Customized setup, I am thinking.

Copy link
Contributor

@mnvsk97 mnvsk97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking time to add a new vector db integration and making corrections to the mongo file. Please have a look at the requested changes.

backend/modules/vector_db/milvus.py Outdated Show resolved Hide resolved
backend/modules/vector_db/milvus.py Outdated Show resolved Hide resolved
backend/modules/vector_db/milvus.py Outdated Show resolved Hide resolved
backend/modules/vector_db/milvus.py Show resolved Hide resolved
backend/modules/vector_db/milvus.py Outdated Show resolved Hide resolved
backend/modules/vector_db/milvus.py Outdated Show resolved Hide resolved
@mnvsk97
Copy link
Contributor

mnvsk97 commented Dec 22, 2024

@AbhishekRP2002 Please add an example config in the compose.env and comment so anyone who wants to use Milvus can refer to it.

@mnvsk97
Copy link
Contributor

mnvsk97 commented Dec 29, 2024

@AbhishekRP2002 PR looks good to me. Regarding your question about testing

Can u please help me in understanding the test setup that you guys follow for testing the functionalities / unit testing ?

We do not have a unit testing setup at the moment. Can you please post a few screenshots/videos showing that the basic operations work? If all looks good, we can go ahead and merge the PR.

@AbhishekRP2002
Copy link
Contributor Author

AbhishekRP2002 commented Dec 29, 2024

We do not have a unit testing setup at the moment. Can you please post a few screenshots/videos showing that the basic operations work? If all looks good, we can go ahead and merge the PR.

yes yes, I wrote a simple test.py script and encountered some bugs, I am fixing them and re-checking, will push the final changes in sometime.

@AbhishekRP2002
Copy link
Contributor Author

AbhishekRP2002 commented Dec 29, 2024

Screenshot 2024-12-29 at 5 39 10 PM `model_config.yaml` file I used ollama for the embedding model

Here is the test.py script I used to test the basic functionalities:

from langchain_openai import OpenAIEmbeddings  # noqa
from langchain_ollama import OllamaEmbeddings
from langchain.docstore.document import Document
from backend.types import VectorDBConfig
from backend.modules.vector_db.milvus import MilvusVectorDB
from backend.logger import logger
from rich.pretty import pprint


logger.setLevel("DEBUG")


def test_milvus_functionality():
    # Initialize configuration for local Milvus
    config = VectorDBConfig(
        provider="milvus",
        local=True,
        url="",
        api_key="",
        config={"metric_type": "COSINE"},
    )

    # Initialize Milvus client
    milvus_client = MilvusVectorDB(config)

    # Initialize embeddings (you'll need your OpenAI API key)
    embeddings = OllamaEmbeddings(model="nomic-embed-text")

    input_text = "The meaning of life is 42"
    vector = embeddings.embed_query(input_text)
    print(vector[:3])

    # Test collection name
    collection_name = "test_collection"

    # Create test documents
    test_documents = [
        Document(
            page_content="This is a test document 1",
            metadata={"_data_point_fqn": "test/doc1", "_data_point_hash": "hash1"},
        ),
        Document(
            page_content="This is a test document 2",
            metadata={"_data_point_fqn": "test/doc2", "_data_point_hash": "hash2"},
        ),
    ]

    try:
        print("Starting Milvus functionality test...")

        # Test 1: Create collection
        print("\n1. Testing collection creation...")
        milvus_client.create_collection(collection_name, embeddings)
        print("✓ Collection created successfully")

        # Test 2: List collections
        print("\n2. Testing list collections...")
        collections = milvus_client.get_collections()
        pprint(milvus_client.milvus_client.describe_collection(collection_name))
        pprint(milvus_client.milvus_client.get_collection_stats(collection_name))
        assert collection_name in collections
        print(f"✓ Collections listed successfully: {collections}")

        # Test 3: Upsert documents
        print("\n3. Testing document upsert...")
        milvus_client.upsert_documents(collection_name, test_documents, embeddings)
        print("✓ Documents upserted successfully")

        pprint(
            milvus_client.milvus_client.query(
                collection_name=collection_name,
                filter="",
                limit=2,
                output_fields=["id", "metadata", "text"],
            )
        )

        # Test 4: List data point vectors
        print("\n4. Testing list data point vectors...")
        vectors = milvus_client.list_data_point_vectors(collection_name, "test/doc1")
        assert len(vectors) > 0
        print(f"✓ Retrieved {len(vectors)} vectors successfully")

        # Test 5: Delete specific vectors
        print("\n5. Testing vector deletion...")
        milvus_client.delete_data_point_vectors(collection_name, vectors)
        print("✓ Vectors deleted successfully")

        # Test 6: Clean up - Delete collection
        print("\n6. Testing collection deletion...")
        milvus_client.delete_collection(collection_name)
        collections = milvus_client.get_collections()
        assert collection_name not in collections
        print("✓ Collection deleted successfully")

        print("\nAll tests completed successfully!")

    except Exception as e:
        print(f"\n Test failed: {str(e)}")
        try:
            if collection_name in milvus_client.get_collections():
                milvus_client.delete_collection(collection_name)
        except Exception as e:
            print(f"Failed to clean up after failure: {str(e)}")
            pass
        raise e


if __name__ == "__main__":
    test_milvus_functionality()

Here is the output screenshot:

Screenshot 2024-12-29 at 5 36 34 PM Screenshot 2024-12-29 at 5 35 42 PM

PS: since auto_id=True so even if there is no primary id or id field in the Document object, milvus will internally create a primary_id and assign respectively.

@mnvsk97
Copy link
Contributor

mnvsk97 commented Dec 30, 2024

Thanks for the detailed explanation @AbhishekRP2002. Please let me know once the PR is ready with the updated test.py and we can go ahead and merge. Also, please resolve the review comments.

@AbhishekRP2002
Copy link
Contributor Author

Thanks for the detailed explanation @AbhishekRP2002. Please let me know once the PR is ready with the updated test.py and we can go ahead and merge. Also, please resolve the review comments.

Resolved the review comments. Updated test.py file as in ? I think I cannot add it in the codebase right ? test.py is under git ignore.

@mnvsk97 mnvsk97 enabled auto-merge January 1, 2025 04:31
@mnvsk97
Copy link
Contributor

mnvsk97 commented Jan 1, 2025

LGTM 🚀

Thanks for the detailed explanation @AbhishekRP2002. Please let me know once the PR is ready with the updated test.py and we can go ahead and merge. Also, please resolve the review comments.

Resolved the review comments. Updated test.py file as in ? I think I cannot add it in the codebase right ? test.py is under git ignore.

I think its best if we resolve testing as a separate setup later on. Looks good for now.

@mnvsk97
Copy link
Contributor

mnvsk97 commented Jan 1, 2025

@AbhishekRP2002 Please resolve the precommit errors. To fix this issue, you can run something like pre-commit run --all-files.

@AbhishekRP2002
Copy link
Contributor Author

@AbhishekRP2002 Please resolve the precommit errors. To fix this issue, you can run something like pre-commit run --all-files.

Sure. Will do it.

auto-merge was automatically disabled January 1, 2025 14:32

Head branch was pushed to by a user without write access

@AbhishekRP2002
Copy link
Contributor Author

hi @mnvsk97 i have fixed the pre-commit errors.
and Happy New year to you, the team and your fam! have a great one

@mnvsk97 mnvsk97 enabled auto-merge January 3, 2025 05:03
@mnvsk97 mnvsk97 self-assigned this Jan 3, 2025
@mnvsk97 mnvsk97 merged commit 431d682 into truefoundry:main Jan 3, 2025
1 check passed
@mnvsk97
Copy link
Contributor

mnvsk97 commented Jan 3, 2025

Merged. Thanks for the contribution. Happy new year !

Copy link

@jinhonglin-ryan jinhonglin-ryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is Jinhong from Milvus, based on the existing milvus integration for cognita, I have added a few more functionalities! Feel free to let me know what you think!

metric_type=self.metric_type, # https://milvus.io/docs/metric.md#Metric-Types : check for other supported metrics
schema=schema,
auto_id=True,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is Jinhong from Milvus. We have transitioned to the new MilvusClient interface and are actively working to phase out the use of the older ORM interface. In this context, your current approach to adding fields to a collection may be updated to the following:

schema = self.milvus_client.create_schema(
    auto_id=False, enable_dynamic_field=True
)

schema.add_field(
    field_name="id",
    datatype=DataType.INT64,
    is_primary=True,
)

schema.add_field(
    field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=vector_size
)

schema.add_field(
    field_name="text",
    datatype=DataType.VARCHAR,
    max_length=65535,
)

schema.add_field(
    field_name="metadata",
    datatype=DataType.JSON,
)

index_params = self.milvus_client.prepare_index_params()

index_params.add_index(
    field_name="vector",
    index_type="FLAT",
    metric_type=self.metric_type,
)

self.milvus_client.create_collection(
    collection_name=collection_name,
    schema=schema,
    index_params=index_params,
)

collection_name=collection_name, filter=delete_expr
)

logger.debug(f"[Milvus] Deleted {len(data_point_vectors)} data point vectors")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding the following 3 functions: list_documents_in_collection, delete_documents, and list_document_vector_points Here is my implementation based on your code:

def list_documents_in_collection(
    self, collection_name: str, base_document_id: str = None
) -> List[str]:
    """
    List all documents in a collection.
    Args:
        collection_name (str): The name of the collection.
        base_document_id (str, optional): Base document ID for filtering. Defaults to None.
    Returns:
        List[str]: List of document IDs.
    """
    logger.debug(
        f"[Milvus] Listing all documents with base document ID {base_document_id} for collection {collection_name}"
    )

    stop = False
    offset = 0
    document_ids_set = set()

    while not stop:
        filter_expr = (
            f'metadata["{DATA_POINT_FQN_METADATA_KEY}"] == "{base_document_id}"'
            if base_document_id
            else ""
        )

        search_result = self.milvus_client.query(
            collection_name=collection_name,
            filter=filter_expr,
            output_fields=["metadata"],
            limit=BATCH_SIZE,
            offset=offset,
        )

        if not search_result:
            stop = True
            break

        for doc in search_result:
            metadata = doc.get("metadata", {})
            if metadata.get(DATA_POINT_FQN_METADATA_KEY):
                document_ids_set.add(metadata.get(DATA_POINT_FQN_METADATA_KEY))
            if len(document_ids_set) > MAX_SCROLL_LIMIT:
                stop = True
                break

        if len(search_result) < BATCH_SIZE:
            stop = True
        else:
            offset += BATCH_SIZE

    logger.debug(
        f"[Milvus] Found {len(document_ids_set)} documents with base document ID {base_document_id} in collection {collection_name}"
    )
    return list(document_ids_set)
def delete_documents(self, collection_name: str, document_ids: List[str]):
    """
    Delete documents from a collection based on document IDs.
    Args:
        collection_name (str): The name of the collection.
        document_ids (List[str]): List of document IDs to delete.
    """
    logger.debug(
        f"[Milvus] Deleting {len(document_ids)} documents from collection {collection_name}"
    )

    if not document_ids:
        logger.warning("[Milvus] No document IDs provided for deletion.")
        return

    try:
        for i in range(0, len(document_ids), BATCH_SIZE):
            document_ids_to_delete = document_ids[i : i + BATCH_SIZE]

            delete_expr = " or ".join(
                [
                    f'metadata["{DATA_POINT_FQN_METADATA_KEY}"] == "{doc_id}"'
                    for doc_id in document_ids_to_delete
                ]
            )

            self.milvus_client.delete(
                collection_name=collection_name, filter=delete_expr
            )

        logger.debug(
            f"[Milvus] Deleted {len(document_ids)} documents from collection {collection_name}"
        )
    except Exception as exp:
        logger.error(f"[Milvus] Error deleting documents: {exp}")
def list_document_vector_points(
    self, collection_name: str
) -> List[DataPointVector]:
    """
    List all document vector points in a collection.
    Args:
        collection_name (str): The name of the collection.
    Returns:
        List[DataPointVector]: List of vector points with metadata.
    """
    logger.debug(
        f"[Milvus] Listing all document vector points for collection {collection_name}"
    )

    stop = False
    offset = 0
    document_vector_points: List[DataPointVector] = []

    while not stop:
        search_result = self.milvus_client.query(
            collection_name=collection_name,
            output_fields=["id", "metadata"],
            limit=BATCH_SIZE,
            offset=offset,
        )

        if not search_result:
            stop = True
            break

        for doc in search_result:
            metadata = doc.get("metadata", {})
            if metadata.get(DATA_POINT_FQN_METADATA_KEY) and metadata.get(
                DATA_POINT_HASH_METADATA_KEY
            ):
                document_vector_points.append(
                    DataPointVector(
                        data_point_vector_id=str(doc["id"]),
                        data_point_fqn=metadata.get(DATA_POINT_FQN_METADATA_KEY),
                        data_point_hash=metadata.get(DATA_POINT_HASH_METADATA_KEY),
                    )
                )
            if len(document_vector_points) > MAX_SCROLL_LIMIT:
                stop = True
                break

        if len(search_result) < BATCH_SIZE:
            stop = True
        else:
            offset += BATCH_SIZE

    logger.debug(
        f"[Milvus] Listed {len(document_vector_points)} document vector points for collection {collection_name}"
    )
    return document_vector_points

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants